Once I shared my first article about “Getting Started with Polars” I started thinking about something exciting: comparing the speed of Polars and Pandas. This curiosity was sparked by all the buzz around the brand-new Pandas 2.0 release, promising lightning-fast performance. Pandas 2.0 was announced to come packed with cool features, including the addition of Apache Arrow (pyarrow) as its backing memory format. The big perk of Apache Arrow is that it makes operations speedier and more memory-friendly. Naturally, this got me wondering: how does Pandas 2.0 measure up against Polars? Let’s dive in and find out!

Warning

Keep in mind: occasionally, I refer to some performance difference figures. Even though I repeated these experiments 100 times each, because computations can have a bit of randomness, there might be slight variations in the exact numbers.

Setup

Code

import pandas as pd
import polars as pl
import polars.selectors as cs

import numpy as np
import timeit
import random

import plotly.io as pio
import plotly.express as px

pio.templates.default = "presentation"

print(pd.__version__)
print(pl.__version__)

2.0.3
0.18.11

Note

Throughout this article, we’ll be working with the latest versions of the packages: Pandas 2.0.3 and Polars 0.18.11.

Generate the synthetic data

For the purpose of benchmarking, we’re going to create our own dataset. To ensure simplicity and uniformity, we’ll generate a dataframe containing columns like name, year of birth, and city of residence – pretty straightforward stuff. We’re aiming for a dataset of 10,000 rows. While we could certainly run this benchmark on a dataframe with more than a million rows, our intention is to provide a realistic example. The code we’ll provide allows you to effortlessly explore various scenarios according to your requirements. Feel free to use it or modify it if you want to perform your own test.

Code

def create_sample_dataframe(n_rows: int, name: str, seed: int = 42) -> pl.DataFrame:
    """
    Create a sample Polars DataFrame with the specified number of rows.

    This function generates random data for several columns and creates a Polars
    DataFrame with the specified number of rows. The generated data includes information
    about names, birth years, cities, zip codes, incomes, marital status, number of children,
    and car brands. The resulting DataFrame is written to CSV and Parquet files named
    'sample.csv' and '{name}.parquet', respectively.

    Parameters
    ----------
    n_rows : int
        The number of rows to generate in the sample DataFrame.
    name : str
        The name to be used when writing the Parquet file.
    seed : int, optional
        The seed to use when generating random data. Default is 42.

    Returns
    -------
    pl.DataFrame
        The first 5 rows of the generated sample DataFrame.
    """
    random.seed(seed)
    np.random.seed(seed)

    data = {
        "name": np.random.choice(["Alice", "Bob", "Charlie", "David", "Eva"], n_rows),
        "born": np.random.randint(1950, 1990, size=n_rows),
        "city": np.random.choice(
            ["Shanghai", "San Francisco", "London", "Munich", "Mumbai"], n_rows
        ),
        "zip_code": np.random.randint(10000, 99999, size=n_rows),
        "income": np.random.normal(50000, 10000, size=n_rows),
        "is_married": np.random.choice([True, False], n_rows),
        "children": np.random.randint(0, 5, size=n_rows),
        "car": np.random.choice(["Ford", "BMW", "Toyota", "Bentley", "Mini"], n_rows),
    }
    temp_df = pl.DataFrame(data)
    temp_df.write_csv("sample.csv")
    temp_df.write_parquet(f"{name}.parquet")
    return temp_df.head()


create_sample_dataframe(n_rows=10_000, name="sample", seed=1)

shape: (5, 8)

name	born	city	zip_code	income	is_married	children	car
str	i32	str	i32	f64	bool	i32	str
"David"	1953	"Shanghai"	35716	69693.044244	false	4	"Bentley"
"Eva"	1984	"London"	30931	30547.027839	true	1	"Toyota"
"Alice"	1979	"San Francisco"	85967	63489.882712	false	1	"Mini"
"Bob"	1969	"San Francisco"	64527	56030.692112	false	1	"Toyota"
"David"	1979	"Munich"	53316	51118.005552	true	3	"Ford"

Define helper class

In order to streamline our code, we’ll create a convenient helper class featuring two essential methods: get_time and get_figure. The get_time method leverages the timeit.repeat function to accurately gauge the execution time of our code. In this context, we’ll set the parameters number = 1 and repeat = 100. This configuration runs the operations 100 times, allowing us to derive key statistics like the mean, median, and standard deviation.

On the other hand, the get_figure method takes these gathered results and generates a visually appealing image using Plotly.

To bring it all together and keep things simple, we’ve designed the run_test function. This function orchestrates the entire process, seamlessly merging the steps outlined above.

Code

class Timeit:
    """
    A class for measuring the execution time of multiple functions.
    This class provides methods for measuring the execution time of
    multiple functions using the `timeit` module. The functions to be
    tested are passed to the constructor as a dictionary, where the
    keys are the names of the functions and the values are the functions
    themselves. The `number` and `repeat` parameters control how many times
    each function is executed and how many times the timing is repeated,
    respectively.

    The `get_time` method measures the execution time of each function and
    stores the results in a dictionary, where the keys are the names of
    the functions and the values are lists of execution times. The `get_figure`
    method generates a box plot of the execution times using the `plotly.express.box`
    function.

    Parameters
    ----------
    func_dict : dict
        A dictionary of functions to test, where the keys are the names of the
        functions and the values are the functions themselves.
    number : int, optional
        The number of times to execute each function in each timing run (default is 1).
    repeat : int, optional
        The number of times to repeat each timing run (default is 20).

    Attributes
    ----------
    results : dict or None
        A dictionary of execution times for each function, where the keys are the names
        of the functions and the values are lists of execution times. This attribute is
        `None` until `get_time` is called.

    Methods
    -------
    get_time()
        Measure the execution time of each function and store the results in the `results` attribute.
    get_figure()
        Generate a box plot of the execution times using `plotly.express.box`.
    """

    def __init__(self, func_dict, number=1, repeat=100):
        self.func_dict = func_dict
        self.number = number
        self.repeat = repeat
        self.results = None

    def get_time(self):
        """
        Measure the execution time of each function and store the results.
        This method uses the `timeit.repeat` function to measure the execution time of each function
        in `func_dict`. The results are stored in a dictionary, where the keys are the names of
        the functions and the values are lists of execution times. This dictionary is also stored
        in the `results` attribute of the `Timeit` object.

        Returns
        -------
        dict
            A dictionary of execution times for each function.
        """
        results = {}
        for k, v in self.func_dict.items():
            results[k] = timeit.repeat(v, number=self.number, repeat=self.repeat)
        self.results = results
        return results

    def get_figure(self):
        """
        Generate a box plot of the execution times.
        This method uses `plotly.express.box` to generate a box plot of the execution times
        stored in `results`.If `results` is `None`, this method raises a `ValueError`.

        Returns
        -------
        plotly.graph_objs.Figure
            A box plot figure object.

        Raises
        ------
        ValueError
            If no results are available (i.e., if `get_time` has not been called).
        """
        if self.results is None:
            raise ValueError("No results available. Please run get_time first.")
        df = pd.DataFrame(self.results)
        mean_times = df.median() * 1000
        title = f'Median Execution Times: <br><sup>{", ".join([f"{k}={v:.6f}ms" for k,v in mean_times.items()])}</sup>'
        return px.box(
            df,
            points="all",
            labels={"value": "time (sec)", "variable": ""},
            title=title,
            width=600,
            height=500,
        )


def run_test(test: dict, name: str):
    """
    Run a timing test on a dictionary of functions using the Timeit class and
    return the results as a DataFrame.

    This function takes a dictionary of functions as input, where the keys are
    the names of the functions and the values are the functions themselves. It creates
    a Timeit object with this dictionary and uses it to measure the execution time of
    each function. The results are then displayed as a box plot using the `get_figure`
    method of the Timeit object and returned as a Pandas DataFrame with columns renamed
    according to the `name` argument.

    Parameters
    ----------
    test : dict
        A dictionary of functions to test, where the keys are the names of the functions
        and the values are the functions themselves.
    name : str
        A string used to rename the columns of the returned DataFrame.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the results of the timing test, with columns renamed
        according to the `name` argument.
    """

    # Create a Timeit object
    t = Timeit(test)
    # Measure the execution time of the functions
    results = t.get_time()
    # Generate a box plot of the execution times
    fig = t.get_figure()
    fig.show()

    return pd.DataFrame(results).rename(
        columns={"Pandas": f"{name}_pandas", "Polars": f"{name}_polars"}
    )

Reading in data From a CSV File and parquet

CSV

Reading CSV files from disk is a task that data scientists often find themselves doing. Now, let’s see how these two libraries compare for this particular job. To maximize the blazing-fast data handling capabilities of PyArrow, we’ll equip Pandas with the engine="pyarrow" and dtype_backend="pyarrow" arguments. Let’s see how these choices shape the performance!

Code

test_dict = {
    "Pandas": lambda: pd.read_csv(
        "sample.csv", engine="pyarrow", dtype_backend="pyarrow"
    ),
    "Polars": lambda: pl.read_csv("sample.csv"),
}
read_csv = run_test(test_dict, "Read csv")

Figure 1: Reading in data From a CSV File

For the sake of comparison, we’ll also demonstrate the timeit function invoked using Jupyter cell magic. You’ll notice that the numbers generated this way are quite closely aligned with ours.

Code

%%timeit
pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")

6.33 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Code

%%timeit
pl.read_csv("sample.csv")

3.01 ms ± 456 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

parquet

Now, let’s read the data in Parquet format.

Code

test_dict = {
    "Pandas": lambda: pd.read_parquet(
        "sample.parquet", engine="pyarrow", dtype_backend="pyarrow"
    ),
    "Polars": lambda: pl.read_parquet("sample.parquet"),
}
read_parquet = run_test(test_dict, "Read parquet")

Figure 2: Reading in data From a parquet File

Note

Polars unquestionably wins this round, it can boast a speed advantage of 2 to 4 times over Pandas.

Selecting Columns

Alright, let’s spice things up a bit and select some columns to see which library races ahead in terms of speed!

Files awaiting in-memory reading

A clever approach to conserve memory and enhance speed involves reading only the columns essential for operations. Consider a scenario where we’re interested in displaying just the names from this dataset. The big question now: how do these libraries measure up in terms of speed? Let’s find out!

Code

test_dict = {
    "Pandas": lambda: pd.read_csv(
        "sample.csv", engine="pyarrow", dtype_backend="pyarrow", usecols=["name"]
    ),
    "Polars": lambda: pl.scan_csv("sample.csv").select(pl.col("name")).collect(),
}
select_col_not_in_memory = run_test(test_dict, "Select column (not in memory)")

Figure 3: Selecting Columns from a File Not Yet in Memory

File is in memory

As anticipated, Polars continues to showcase its swiftness. It’s worth highlighting the usage of the lazy and collect methods in Polars. These nifty tools grant us access to the library’s clever query optimization techniques, which play a pivotal role in significantly enhancing performance. OK, one step further: suppose our files are already loaded into memory. Would there still be a distinction in performance under this circumstance?

Code

df_pandas = pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")
df_polars = pl.read_csv("sample.csv")

Code

test_dict = {
    "Pandas": lambda: df_pandas.loc[:, "name"],
    "Polars": lambda: df_polars.lazy().select(pl.col("name")).collect(),
}
select_col_in_memory = run_test(test_dict, "Select column")

Figure 4: Selecting Columns from a File Already in Memory

Note

While Polars showed a significant speed advantage for tasks involving pre-read files, both libraries perform similarly when the files are already in memory.

Filtering Rows

Now, let’s explore the scenario where we filter our dataset based on one or more column values

Based on one condition

For our simple scenario, we’ll be narrowing down our focus to filter data based on individuals with the name “David”.

Code

test_dict = {
    "Pandas": lambda: (df_pandas.query("name=='David'")),
    "Polars": lambda: (df_polars.lazy().filter((pl.col("name") == "David")).collect()),
}
filter_row_one_condition = run_test(test_dict, "Filter (simple)")

Figure 5: Filtering Rows Based on One Condition

Based on multiple conditions

Now, for a more intricate challenge, we’re going to dive into querying the data to extract individuals who meet specific criteria: those named David, born after 1980, residing in a city other than London, married, and with three children.

Code

test_dict = {
    "Pandas": lambda: (
        df_pandas.query(
            "name=='David' and born>1980 and city != 'London' or is_married == True and children >= 3"
        )
    ),
    "Polars": lambda: (
        df_polars.lazy()
        .filter(
            (pl.col("name") == "David")
            & (pl.col("born") > 1980)
            & (pl.col("city") != "London")
            | (pl.col("is_married") == True) & (pl.col("children") >= 3)
        )
        .collect()
    ),
}
filter_row_multiple_condition = run_test(test_dict, "Filter (complex)")

Figure 6: Filtering Rows Based on Multiple Condition

Note

Both libraries tackled this challenge quite well, yet Pandas struggled to keep pace with Polars. It’s intriguing to observe that while Pandas required nearly twice the time for the more intricate task, Polars managed to complete it in almost the same amount of time. Parallelization in action.

Performing operations on columns

Now, let’s roll up our sleeves and dive into performing some operations on the columns.

Single operation

As a single operation, we’ll simply calculate the century in which these individuals were born.

Code

test_dict = {
    "Pandas": lambda: (df_pandas.assign(born=lambda df: df.born.div(100).round())),
    "Polars": lambda: (
        df_polars.lazy().with_columns((pl.col("born") / 100).round()).collect()
    ),
}
operate_one_column = run_test(test_dict, "Operate (one column)")

Figure 7: Performing a Singme Operation on a Column

Multiple operations

Let’s also explore what happens when performing multiple operations on the columns. We’ll mix things up with some string operations, mapping, and math calculations to see how these libraries handle it!

Code

test_dict = {
    "Pandas": lambda: (
        df_pandas.assign(
            born=lambda df: df.born.div(100).round(),
            name=lambda df: df.name.str.lower(),
            city=lambda df: df.city.str.upper(),
            zip_code=lambda df: df.zip_code.mul(2),
            income=lambda df: df.income.div(10).astype("str"),
            is_married=lambda df: df.is_married.map({False: 0, True: 1}),
            children=lambda df: df.children.astype("bool"),
            car=lambda df: df.car.str[0],
        )
    ),
    "Polars": lambda: (
        df_polars.lazy()
        .with_columns(
            [
                (pl.col("born") / 100).round(),
                pl.col("name").str.to_lowercase(),
                pl.col("city").str.to_uppercase(),
                pl.col("zip_code") * 2,
                (pl.col("income") / 10).cast(pl.Utf8),
                pl.col("is_married").map_dict({False: 0, True: 1}),
                pl.col("children").cast(pl.Boolean),
                pl.col("car").str.slice(0, length=1),
            ]
        )
        .collect()
    ),
}
operate_multiple_column = run_test(test_dict, "Operate (more columns)")

Figure 8: Perfrming a Multiple Operation on Columns

Note

Once again, Polars takes the lead. While both libraries required more time for the task involving multiple operations, Polars demonstrated superior scalability in this scenario.

Concatenating Data

Now, let’s turn our attention to concatenating two datasets. Let the merging begin!

Code

df_pandas2 = df_pandas.copy(deep=True)
df_polars2 = df_polars.clone()

Code

test_dict = {
    "Pandas": lambda: pd.concat([df_pandas, df_pandas2], axis=0),
    "Polars": lambda: pl.concat([df_polars, df_polars2], how="vertical"),
}
concatenate = run_test(test_dict, "Concatenate")

Figure 9: Concatenating Dataframes

Note

Once more, Polars shines with a remarkable speed advantage.

Aggregation

Simple

Time to shift our attention to aggregation. First up, a simple task: let’s calculate the mean income based on names. Then, for a bit more complexity, we’ll dive into computing statistics involving the income, children, and car columns. Things are about to get interesting!

Code

test_dict = {
    "Pandas": lambda: (df_pandas.groupby("name").income.mean()),
    "Polars": lambda: (df_polars.lazy().groupby("name").mean().collect()),
}
aggregate_simple = run_test(test_dict, "Aggregate (simple)")

Figure 10: Performing a simple aggregation

More complex

Code

test_dict = {
    "Pandas": lambda: (
        df_pandas.groupby(["name", "car", "is_married"]).agg(
            born_min=("born", min),
            born_max=("born", max),
            income_mean=("income", np.mean),
            income_median=("income", np.median),
            children_mean=("children", np.mean),
            car_count=("car", "count"),
        )
    ),
    "Polars": lambda: (
        df_polars.lazy()
        .groupby(["name", "car", "is_married"])
        .agg(
            [
                pl.col("born").min().alias("born_min"),
                pl.col("born").max().alias("born_max"),
                pl.col("income").mean().alias("income_mean"),
                pl.col("income").median().alias("income_median"),
                pl.col("children").mean().alias("children_mean"),
                pl.col("car").count().alias("car_count"),
            ]
        )
        .collect()
    ),
}
aggregate_complex = run_test(test_dict, "Aggregate (complex)")

Figure 11: Performing a complex aggregation

Note

While Pandas showcased noteworthy speed for the simple aggregation, the more intricate task exposed significant disparities between the two libraries. Polars took a commanding lead in this scenario, presenting a considerably faster performance compared to Pandas.

Whole workflow

We’re going to tackle this in two versions. First, we’ll use pl.read_csv for Polars, assessing the time it takes to complete the entire workflow on a DataFrame already in memory. Then, for the second version, we’ll employ pl.scan_csv for Polars. This nifty function lets us lazily read data from a CSV file or multiple files using glob patterns.

The cool part? With pl.scan_csv, the query optimizer can push down predicates and projections to the scan level. This nifty move has the potential to cut down on memory overhead. Let’s dive into both versions and see how they stack up!

Using `pl.read_csv`

Code

test_dict = {
    "Pandas": lambda: (
        df_pandas.loc[:, lambda df: ~df.columns.isin(["is_married"])]
        .query("name=='David' and born>1980 and city != 'London' and children >= 3")
        .assign(
            born=lambda df: df.born.div(100).round(),
            name=lambda df: df.name.str.lower(),
            city=lambda df: df.city.str.upper(),
            zip_code=lambda df: df.zip_code.mul(2),
            income=lambda df: df.income.div(10),
            children=lambda df: df.children.astype("bool"),
            car=lambda df: df.car.str[0],
        )
        .groupby(
            [
                "name",
                "car",
            ]
        )
        .agg(
            born_min=("born", min),
            born_max=("born", max),
            income_mean=("income", np.mean),
            income_median=("income", np.median),
            children_mean=("children", np.mean),
            car_count=("car", "count"),
        )
    ),
    "Polars": lambda: (
        df_polars.lazy()
        .select(cs.all() - cs.ends_with("married"))
        .filter(
            (pl.col("name") == "David")
            & (pl.col("born") > 1980)
            & (pl.col("city") != "London")
            & (pl.col("children") >= 3)
        )
        .with_columns(
            [
                (pl.col("born") / 100).round(),
                pl.col("name").str.to_lowercase(),
                pl.col("city").str.to_uppercase(),
                pl.col("zip_code") * 2,
                (pl.col("income") / 10),
                pl.col("children").cast(pl.Boolean),
                pl.col("car").str.slice(0, length=1),
            ]
        )
        .groupby(
            [
                "name",
                "car",
            ]
        )
        .agg(
            [
                pl.col("born").min().alias("born_min"),
                pl.col("born").max().alias("born_max"),
                pl.col("income").mean().alias("income_mean"),
                pl.col("income").median().alias("income_median"),
                pl.col("children").mean().alias("children_mean"),
                pl.col("car").count().alias("car_count"),
            ]
        )
        .collect()
    ),
}
whole_workflow_read_csv = run_test(test_dict, "Whole workflow")

Figure 12: Performing a representative workflow

Using `pl.scan_csv`

Code

test_dict = {
    "Pandas": lambda: (
        df_pandas.loc[:, lambda df: ~df.columns.isin(["is_married"])]
        .query("name=='David' and born>1980 and city != 'London' and children >= 3")
        .assign(
            born=lambda df: df.born.div(100).round(),
            name=lambda df: df.name.str.lower(),
            city=lambda df: df.city.str.upper(),
            zip_code=lambda df: df.zip_code.mul(2),
            income=lambda df: df.income.div(10),
            children=lambda df: df.children.astype("bool"),
            car=lambda df: df.car.str[0],
        )
        .groupby(
            [
                "name",
                "car",
            ]
        )
        .agg(
            born_min=("born", min),
            born_max=("born", max),
            income_mean=("income", np.mean),
            income_median=("income", np.median),
            children_mean=("children", np.mean),
            car_count=("car", "count"),
        )
    ),
    "Polars": lambda: (
        pl.scan_csv("sample.csv")
        .select(cs.all() - cs.ends_with("married"))
        .filter(
            (pl.col("name") == "David")
            & (pl.col("born") > 1980)
            & (pl.col("city") != "London")
            & (pl.col("children") >= 3)
        )
        .with_columns(
            [
                (pl.col("born") / 100).round(),
                pl.col("name").str.to_lowercase(),
                pl.col("city").str.to_uppercase(),
                pl.col("zip_code") * 2,
                (pl.col("income") / 10),
                pl.col("children").cast(pl.Boolean),
                pl.col("car").str.slice(0, length=1),
            ]
        )
        .groupby(
            [
                "name",
                "car",
            ]
        )
        .agg(
            [
                pl.col("born").min().alias("born_min"),
                pl.col("born").max().alias("born_max"),
                pl.col("income").mean().alias("income_mean"),
                pl.col("income").median().alias("income_median"),
                pl.col("children").mean().alias("children_mean"),
                pl.col("car").count().alias("car_count"),
            ]
        )
        .collect()
    ),
}
whole_workflow_scan_csv = run_test(test_dict, "Whole workflow (scan_csv)")

Figure 13: Performing a representative workflow (using pl_scan_csv for Polars)

As evident, the utilization of scan_csv increased the required time by about 3-4 times. However, even with this increase, Polars still manages to maintain a substantial advantage of around 5 times faster than the entire workflow executed using Pandas.

Note

When we consider the entirety of the data processing pipeline, irrespective of the file reading approach, Polars emerges as the victor. It consistently exhibits a considerable speed advantage compared to Pandas.

Putting it all together

Time to bring together all the things we’ve explored! Let’s sum up what we’ve learned.

Code

summary = pd.concat(
    [
        read_csv,
        read_parquet,
        # select_col_not_in_memory,
        select_col_in_memory,
        filter_row_one_condition,
        filter_row_multiple_condition,
        operate_one_column,
        operate_multiple_column,
        concatenate,
        aggregate_simple,
        aggregate_complex,
        whole_workflow_read_csv,
        # whole_workflow_scan_csv
    ],
    axis=1,
)

fig = (
    summary.melt(var_name="experiment", value_name="time_sec")
    .assign(
        package=lambda df: df.experiment.str[-6:],
        experiment=lambda df: df.experiment.str[:-7],
    )
    .groupby(["experiment", "package"])
    .time_sec.mean()
    .mul(1000)
    .reset_index()
    .pivot(columns="package", values="time_sec", index="experiment")
    .sort_values(
        by="polars",
    )
    .reset_index()
    .pipe(
        lambda df: px.bar(
            df,
            y="experiment",
            x=["pandas", "polars"],
            barmode="group",
            text_auto='.2s',
            orientation="h",
            labels={"value": "Average execution Time (msec)", "experiment": ""},
            title= f'<b>Speed Showdown in Data Workflow Operations</b> <br><sup><i>Pandas {pd.__version__} vs. Polars {pl.__version__}</i></sup>',
            width=1000,
            height=700,
        )
    )
)
fig.update_xaxes(tickangle=90)
fig.update_layout(yaxis=dict(automargin=True), legend_title_text="Library")
fig.update_traces(textposition='outside')

Figure 14: Analyzing Speed Differences Between Pandas and Polars in Common Data Science Workflow Operations

Throughout our exploration, Polars has consistently outperformed Pandas. We intentionally focused on tasks that frequently arise in data analysis, encompassing file reading, column selection, filtering, and more. However, where Polars truly shines is in aggregation, capitalizing on its multiprocessing prowess. Take a glance at Figure 14, and you’ll notice that reading files is the most time-consuming step for Polars. To address this (and reduce memory usage at the same time), we have the nifty scan_csv function that enables Polars to operate in a lazy mode, optimizing the entire data pipeline. It’s clear that Polars packs a punch in the realm of data processing!

Should you consider using Polars? Based on the findings of this experiment, it’s evident that Polars can bring substantial benefits to every stage of your data processing. Its remarkable speed can make a significant impact. If you’re open to investing time in understanding its API (which isn’t drastically different from Pandas), I’m confident your projects will gain a significant speed boost. And don’t forget, the trajectory of the project seems promising, implying a bright future ahead.

Whatever path you choose, ensure you become well-acquainted with your selected library and stay current with emerging technologies. And above all, relish the journey. Happy coding, and until next time! 🐼🐍🤓💻

--- title: 'The Ultimate Speed Test: Pandas vs Polars ' author: Adam Cseresznye date: '2023-08-13' categories: - Polars - Pandas jupyter: python3 toc: true format: html: code-fold: true code-tools: true --- #| fig-cap: Photo by Hans-Jurgen Mager on Unsplash #| label: fig-fig0 ![](hans-jurgen-mager-qQWV91TTBrE-unsplash.jpg){fig-align="center" width=50%} Once I shared my first article about ["Getting Started with Polars"](https://adamcseresznye.github.io/blog/posts/polars/Polars_revised.html) I started thinking about something exciting: comparing the speed of Polars and Pandas. This curiosity was sparked by all the buzz around the brand-new [Pandas 2.0 release](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html), promising lightning-fast performance. Pandas 2.0 was announced to come packed with cool features, including the addition of Apache Arrow (pyarrow) as its backing memory format. The big perk of Apache Arrow is that it makes operations speedier and more memory-friendly. Naturally, this got me wondering: how does Pandas 2.0 measure up against Polars? Let's dive in and find out! ::: {.callout-warning} Keep in mind: occasionally, I refer to some performance difference figures. Even though I repeated these experiments 100 times each, because computations can have a bit of randomness, there might be slight variations in the exact numbers. ::: # Setup ```{python} import pandas as pd import polars as pl import polars.selectors as cs import numpy as np import timeit import random import plotly.io as pio import plotly.express as px pio.templates.default = "presentation" print(pd.__version__) print(pl.__version__) ``` ::: {.callout-note} Throughout this article, we'll be working with the latest versions of the packages: Pandas 2.0.3 and Polars 0.18.11. ::: # Generate the synthetic data For the purpose of benchmarking, we're going to create our own dataset. To ensure simplicity and uniformity, we'll generate a dataframe containing columns like name, year of birth, and city of residence – pretty straightforward stuff. We're aiming for a dataset of 10,000 rows. While we could certainly run this benchmark on a dataframe with more than a million rows, our intention is to provide a realistic example. The code we'll provide allows you to effortlessly explore various scenarios according to your requirements. Feel free to use it or modify it if you want to perform your own test. ```{python} def create_sample_dataframe(n_rows: int, name: str, seed: int = 42) -> pl.DataFrame: """ Create a sample Polars DataFrame with the specified number of rows. This function generates random data for several columns and creates a Polars DataFrame with the specified number of rows. The generated data includes information about names, birth years, cities, zip codes, incomes, marital status, number of children, and car brands. The resulting DataFrame is written to CSV and Parquet files named 'sample.csv' and '{name}.parquet', respectively. Parameters ---------- n_rows : int The number of rows to generate in the sample DataFrame. name : str The name to be used when writing the Parquet file. seed : int, optional The seed to use when generating random data. Default is 42. Returns ------- pl.DataFrame The first 5 rows of the generated sample DataFrame. """ random.seed(seed) np.random.seed(seed) data = { "name": np.random.choice(["Alice", "Bob", "Charlie", "David", "Eva"], n_rows), "born": np.random.randint(1950, 1990, size=n_rows), "city": np.random.choice( ["Shanghai", "San Francisco", "London", "Munich", "Mumbai"], n_rows ), "zip_code": np.random.randint(10000, 99999, size=n_rows), "income": np.random.normal(50000, 10000, size=n_rows), "is_married": np.random.choice([True, False], n_rows), "children": np.random.randint(0, 5, size=n_rows), "car": np.random.choice(["Ford", "BMW", "Toyota", "Bentley", "Mini"], n_rows), } temp_df = pl.DataFrame(data) temp_df.write_csv("sample.csv") temp_df.write_parquet(f"{name}.parquet") return temp_df.head() create_sample_dataframe(n_rows=10_000, name="sample", seed=1) ``` # Define helper class In order to streamline our code, we'll create a convenient helper class featuring two essential methods: `get_time` and `get_figure`. The `get_time` method leverages the `timeit.repeat` function to accurately gauge the execution time of our code. In this context, we'll set the parameters `number = 1` and `repeat = 100`. This configuration runs the operations 100 times, allowing us to derive key statistics like the mean, median, and standard deviation. On the other hand, the `get_figure` method takes these gathered results and generates a visually appealing image using Plotly. To bring it all together and keep things simple, we've designed the `run_test` function. This function orchestrates the entire process, seamlessly merging the steps outlined above. ```{python} class Timeit: """ A class for measuring the execution time of multiple functions. This class provides methods for measuring the execution time of multiple functions using the `timeit` module. The functions to be tested are passed to the constructor as a dictionary, where the keys are the names of the functions and the values are the functions themselves. The `number` and `repeat` parameters control how many times each function is executed and how many times the timing is repeated, respectively. The `get_time` method measures the execution time of each function and stores the results in a dictionary, where the keys are the names of the functions and the values are lists of execution times. The `get_figure` method generates a box plot of the execution times using the `plotly.express.box` function. Parameters ---------- func_dict : dict A dictionary of functions to test, where the keys are the names of the functions and the values are the functions themselves. number : int, optional The number of times to execute each function in each timing run (default is 1). repeat : int, optional The number of times to repeat each timing run (default is 20). Attributes ---------- results : dict or None A dictionary of execution times for each function, where the keys are the names of the functions and the values are lists of execution times. This attribute is `None` until `get_time` is called. Methods ------- get_time() Measure the execution time of each function and store the results in the `results` attribute. get_figure() Generate a box plot of the execution times using `plotly.express.box`. """ def __init__(self, func_dict, number=1, repeat=100): self.func_dict = func_dict self.number = number self.repeat = repeat self.results = None def get_time(self): """ Measure the execution time of each function and store the results. This method uses the `timeit.repeat` function to measure the execution time of each function in `func_dict`. The results are stored in a dictionary, where the keys are the names of the functions and the values are lists of execution times. This dictionary is also stored in the `results` attribute of the `Timeit` object. Returns ------- dict A dictionary of execution times for each function. """ results = {} for k, v in self.func_dict.items(): results[k] = timeit.repeat(v, number=self.number, repeat=self.repeat) self.results = results return results def get_figure(self): """ Generate a box plot of the execution times. This method uses `plotly.express.box` to generate a box plot of the execution times stored in `results`.If `results` is `None`, this method raises a `ValueError`. Returns ------- plotly.graph_objs.Figure A box plot figure object. Raises ------ ValueError If no results are available (i.e., if `get_time` has not been called). """ if self.results is None: raise ValueError("No results available. Please run get_time first.") df = pd.DataFrame(self.results) mean_times = df.median() * 1000 title = f'Median Execution Times: {", ".join([f"{k}={v:.6f}ms" for k,v in mean_times.items()])}' return px.box( df, points="all", labels={"value": "time (sec)", "variable": ""}, title=title, width=600, height=500, ) def run_test(test: dict, name: str): """ Run a timing test on a dictionary of functions using the Timeit class and return the results as a DataFrame. This function takes a dictionary of functions as input, where the keys are the names of the functions and the values are the functions themselves. It creates a Timeit object with this dictionary and uses it to measure the execution time of each function. The results are then displayed as a box plot using the `get_figure` method of the Timeit object and returned as a Pandas DataFrame with columns renamed according to the `name` argument. Parameters ---------- test : dict A dictionary of functions to test, where the keys are the names of the functions and the values are the functions themselves. name : str A string used to rename the columns of the returned DataFrame. Returns ------- pd.DataFrame A DataFrame containing the results of the timing test, with columns renamed according to the `name` argument. """ # Create a Timeit object t = Timeit(test) # Measure the execution time of the functions results = t.get_time() # Generate a box plot of the execution times fig = t.get_figure() fig.show() return pd.DataFrame(results).rename( columns={"Pandas": f"{name}_pandas", "Polars": f"{name}_polars"} ) ``` # Reading in data From a CSV File and parquet ## CSV Reading CSV files from disk is a task that data scientists often find themselves doing. Now, let's see how these two libraries compare for this particular job. To maximize the blazing-fast data handling capabilities of PyArrow, we'll equip Pandas with the `engine="pyarrow"` and `dtype_backend="pyarrow"` arguments. Let's see how these choices shape the performance! ```{python} #| fig-cap: Reading in data From a CSV File #| label: fig-fig1 test_dict = { "Pandas": lambda: pd.read_csv( "sample.csv", engine="pyarrow", dtype_backend="pyarrow" ), "Polars": lambda: pl.read_csv("sample.csv"), } read_csv = run_test(test_dict, "Read csv") ``` For the sake of comparison, we'll also demonstrate the timeit function invoked using Jupyter cell magic. You'll notice that the numbers generated this way are quite closely aligned with ours. ```{python} #| tags: [] %%timeit pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow") ``` ```{python} #| tags: [] %%timeit pl.read_csv("sample.csv") ``` ## parquet Now, let's read the data in Parquet format. ```{python} #| fig-cap: Reading in data From a parquet File #| label: fig-fig2 #| tags: [] test_dict = { "Pandas": lambda: pd.read_parquet( "sample.parquet", engine="pyarrow", dtype_backend="pyarrow" ), "Polars": lambda: pl.read_parquet("sample.parquet"), } read_parquet = run_test(test_dict, "Read parquet") ``` ::: {.callout-note} Polars unquestionably wins this round, it can boast a speed advantage of 2 to 4 times over Pandas. ::: # Selecting Columns Alright, let's spice things up a bit and select some columns to see which library races ahead in terms of speed! ## Files awaiting in-memory reading A clever approach to conserve memory and enhance speed involves reading only the columns essential for operations. Consider a scenario where we're interested in displaying just the names from this dataset. The big question now: how do these libraries measure up in terms of speed? Let's find out! ```{python} #| fig-cap: Selecting Columns from a File Not Yet in Memory #| label: fig-fig3 test_dict = { "Pandas": lambda: pd.read_csv( "sample.csv", engine="pyarrow", dtype_backend="pyarrow", usecols=["name"] ), "Polars": lambda: pl.scan_csv("sample.csv").select(pl.col("name")).collect(), } select_col_not_in_memory = run_test(test_dict, "Select column (not in memory)") ``` ## File is in memory As anticipated, Polars continues to showcase its swiftness. It's worth highlighting the usage of the `lazy` and `collect` methods in Polars. These nifty tools grant us access to the library's clever query optimization techniques, which play a pivotal role in significantly enhancing performance. OK, one step further: suppose our files are already loaded into memory. Would there still be a distinction in performance under this circumstance? ```{python} #| tags: [] df_pandas = pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow") df_polars = pl.read_csv("sample.csv") ``` ```{python} #| fig-cap: Selecting Columns from a File Already in Memory #| label: fig-fig4 #| tags: [] test_dict = { "Pandas": lambda: df_pandas.loc[:, "name"], "Polars": lambda: df_polars.lazy().select(pl.col("name")).collect(), } select_col_in_memory = run_test(test_dict, "Select column") ``` ::: {.callout-note} While Polars showed a significant speed advantage for tasks involving pre-read files, both libraries perform similarly when the files are already in memory. ::: # Filtering Rows Now, let's explore the scenario where we filter our dataset based on one or more column values ## Based on one condition For our simple scenario, we'll be narrowing down our focus to filter data based on individuals with the name "David". ```{python} #| fig-cap: Filtering Rows Based on One Condition #| label: fig-fig5 test_dict = { "Pandas": lambda: (df_pandas.query("name=='David'")), "Polars": lambda: (df_polars.lazy().filter((pl.col("name") == "David")).collect()), } filter_row_one_condition = run_test(test_dict, "Filter (simple)") ``` ## Based on multiple conditions Now, for a more intricate challenge, we're going to dive into querying the data to extract individuals who meet specific criteria: those named David, born after 1980, residing in a city other than London, married, and with three children. ```{python} #| fig-cap: Filtering Rows Based on Multiple Condition #| label: fig-fig6 test_dict = { "Pandas": lambda: ( df_pandas.query( "name=='David' and born>1980 and city != 'London' or is_married == True and children >= 3" ) ), "Polars": lambda: ( df_polars.lazy() .filter( (pl.col("name") == "David") & (pl.col("born") > 1980) & (pl.col("city") != "London") | (pl.col("is_married") == True) & (pl.col("children") >= 3) ) .collect() ), } filter_row_multiple_condition = run_test(test_dict, "Filter (complex)") ``` ::: {.callout-note} Both libraries tackled this challenge quite well, yet Pandas struggled to keep pace with Polars. It's intriguing to observe that while Pandas required nearly twice the time for the more intricate task, Polars managed to complete it in almost the same amount of time. Parallelization in action. ::: # Performing operations on columns Now, let's roll up our sleeves and dive into performing some operations on the columns. ## Single operation As a single operation, we'll simply calculate the century in which these individuals were born. ```{python} #| fig-cap: Performing a Singme Operation on a Column #| label: fig-fig7 test_dict = { "Pandas": lambda: (df_pandas.assign(born=lambda df: df.born.div(100).round())), "Polars": lambda: ( df_polars.lazy().with_columns((pl.col("born") / 100).round()).collect() ), } operate_one_column = run_test(test_dict, "Operate (one column)") ``` ## Multiple operations Let's also explore what happens when performing multiple operations on the columns. We'll mix things up with some string operations, mapping, and math calculations to see how these libraries handle it! ```{python} #| fig-cap: Perfrming a Multiple Operation on Columns #| label: fig-fig8 test_dict = { "Pandas": lambda: ( df_pandas.assign( born=lambda df: df.born.div(100).round(), name=lambda df: df.name.str.lower(), city=lambda df: df.city.str.upper(), zip_code=lambda df: df.zip_code.mul(2), income=lambda df: df.income.div(10).astype("str"), is_married=lambda df: df.is_married.map({False: 0, True: 1}), children=lambda df: df.children.astype("bool"), car=lambda df: df.car.str[0], ) ), "Polars": lambda: ( df_polars.lazy() .with_columns( [ (pl.col("born") / 100).round(), pl.col("name").str.to_lowercase(), pl.col("city").str.to_uppercase(), pl.col("zip_code") * 2, (pl.col("income") / 10).cast(pl.Utf8), pl.col("is_married").map_dict({False: 0, True: 1}), pl.col("children").cast(pl.Boolean), pl.col("car").str.slice(0, length=1), ] ) .collect() ), } operate_multiple_column = run_test(test_dict, "Operate (more columns)") ``` ::: {.callout-note} Once again, Polars takes the lead. While both libraries required more time for the task involving multiple operations, Polars demonstrated superior scalability in this scenario. ::: # Concatenating Data Now, let's turn our attention to concatenating two datasets. Let the merging begin! ```{python} df_pandas2 = df_pandas.copy(deep=True) df_polars2 = df_polars.clone() ``` ```{python} #| fig-cap: Concatenating Dataframes #| label: fig-fig9 test_dict = { "Pandas": lambda: pd.concat([df_pandas, df_pandas2], axis=0), "Polars": lambda: pl.concat([df_polars, df_polars2], how="vertical"), } concatenate = run_test(test_dict, "Concatenate") ``` ::: {.callout-note} Once more, Polars shines with a remarkable speed advantage. ::: # Aggregation ## Simple Time to shift our attention to aggregation. First up, a simple task: let's calculate the mean income based on names. Then, for a bit more complexity, we'll dive into computing statistics involving the income, children, and car columns. Things are about to get interesting! ```{python} #| fig-cap: Performing a simple aggregation #| label: fig-fig10 test_dict = { "Pandas": lambda: (df_pandas.groupby("name").income.mean()), "Polars": lambda: (df_polars.lazy().groupby("name").mean().collect()), } aggregate_simple = run_test(test_dict, "Aggregate (simple)") ``` ## More complex ```{python} #| fig-cap: Performing a complex aggregation #| label: fig-fig11 test_dict = { "Pandas": lambda: ( df_pandas.groupby(["name", "car", "is_married"]).agg( born_min=("born", min), born_max=("born", max), income_mean=("income", np.mean), income_median=("income", np.median), children_mean=("children", np.mean), car_count=("car", "count"), ) ), "Polars": lambda: ( df_polars.lazy() .groupby(["name", "car", "is_married"]) .agg( [ pl.col("born").min().alias("born_min"), pl.col("born").max().alias("born_max"), pl.col("income").mean().alias("income_mean"), pl.col("income").median().alias("income_median"), pl.col("children").mean().alias("children_mean"), pl.col("car").count().alias("car_count"), ] ) .collect() ), } aggregate_complex = run_test(test_dict, "Aggregate (complex)") ``` ::: {.callout-note} While Pandas showcased noteworthy speed for the simple aggregation, the more intricate task exposed significant disparities between the two libraries. Polars took a commanding lead in this scenario, presenting a considerably faster performance compared to Pandas. ::: # Whole workflow We're going to tackle this in two versions. First, we'll use `pl.read_csv` for Polars, assessing the time it takes to complete the entire workflow on a DataFrame already in memory. Then, for the second version, we'll employ `pl.scan_csv` for Polars. This nifty function lets us lazily read data from a CSV file or multiple files using glob patterns. The cool part? With `pl.scan_csv`, the query optimizer can push down predicates and projections to the scan level. This nifty move has the potential to cut down on memory overhead. Let's dive into both versions and see how they stack up! ## Using `pl.read_csv` ```{python} #| fig-cap: Performing a representative workflow #| label: fig-fig12 test_dict = { "Pandas": lambda: ( df_pandas.loc[:, lambda df: ~df.columns.isin(["is_married"])] .query("name=='David' and born>1980 and city != 'London' and children >= 3") .assign( born=lambda df: df.born.div(100).round(), name=lambda df: df.name.str.lower(), city=lambda df: df.city.str.upper(), zip_code=lambda df: df.zip_code.mul(2), income=lambda df: df.income.div(10), children=lambda df: df.children.astype("bool"), car=lambda df: df.car.str[0], ) .groupby( [ "name", "car", ] ) .agg( born_min=("born", min), born_max=("born", max), income_mean=("income", np.mean), income_median=("income", np.median), children_mean=("children", np.mean), car_count=("car", "count"), ) ), "Polars": lambda: ( df_polars.lazy() .select(cs.all() - cs.ends_with("married")) .filter( (pl.col("name") == "David") & (pl.col("born") > 1980) & (pl.col("city") != "London") & (pl.col("children") >= 3) ) .with_columns( [ (pl.col("born") / 100).round(), pl.col("name").str.to_lowercase(), pl.col("city").str.to_uppercase(), pl.col("zip_code") * 2, (pl.col("income") / 10), pl.col("children").cast(pl.Boolean), pl.col("car").str.slice(0, length=1), ] ) .groupby( [ "name", "car", ] ) .agg( [ pl.col("born").min().alias("born_min"), pl.col("born").max().alias("born_max"), pl.col("income").mean().alias("income_mean"), pl.col("income").median().alias("income_median"), pl.col("children").mean().alias("children_mean"), pl.col("car").count().alias("car_count"), ] ) .collect() ), } whole_workflow_read_csv = run_test(test_dict, "Whole workflow") ``` ## Using `pl.scan_csv` ```{python} #| fig-cap: Performing a representative workflow (using `pl_scan_csv` for Polars) #| label: fig-fig13 test_dict = { "Pandas": lambda: ( df_pandas.loc[:, lambda df: ~df.columns.isin(["is_married"])] .query("name=='David' and born>1980 and city != 'London' and children >= 3") .assign( born=lambda df: df.born.div(100).round(), name=lambda df: df.name.str.lower(), city=lambda df: df.city.str.upper(), zip_code=lambda df: df.zip_code.mul(2), income=lambda df: df.income.div(10), children=lambda df: df.children.astype("bool"), car=lambda df: df.car.str[0], ) .groupby( [ "name", "car", ] ) .agg( born_min=("born", min), born_max=("born", max), income_mean=("income", np.mean), income_median=("income", np.median), children_mean=("children", np.mean), car_count=("car", "count"), ) ), "Polars": lambda: ( pl.scan_csv("sample.csv") .select(cs.all() - cs.ends_with("married")) .filter( (pl.col("name") == "David") & (pl.col("born") > 1980) & (pl.col("city") != "London") & (pl.col("children") >= 3) ) .with_columns( [ (pl.col("born") / 100).round(), pl.col("name").str.to_lowercase(), pl.col("city").str.to_uppercase(), pl.col("zip_code") * 2, (pl.col("income") / 10), pl.col("children").cast(pl.Boolean), pl.col("car").str.slice(0, length=1), ] ) .groupby( [ "name", "car", ] ) .agg( [ pl.col("born").min().alias("born_min"), pl.col("born").max().alias("born_max"), pl.col("income").mean().alias("income_mean"), pl.col("income").median().alias("income_median"), pl.col("children").mean().alias("children_mean"), pl.col("car").count().alias("car_count"), ] ) .collect() ), } whole_workflow_scan_csv = run_test(test_dict, "Whole workflow (scan_csv)") ``` As evident, the utilization of `scan_csv` increased the required time by about 3-4 times. However, even with this increase, Polars still manages to maintain a substantial advantage of around 5 times faster than the entire workflow executed using Pandas. ::: {.callout-note} When we consider the entirety of the data processing pipeline, irrespective of the file reading approach, Polars emerges as the victor. It consistently exhibits a considerable speed advantage compared to Pandas. ::: # Putting it all together Time to bring together all the things we've explored! Let's sum up what we've learned. ```{python} #| fig-cap: Analyzing Speed Differences Between Pandas and Polars in Common Data Science Workflow Operations #| label: fig-fig14 #| column: page summary = pd.concat( [ read_csv, read_parquet, # select_col_not_in_memory, select_col_in_memory, filter_row_one_condition, filter_row_multiple_condition, operate_one_column, operate_multiple_column, concatenate, aggregate_simple, aggregate_complex, whole_workflow_read_csv, # whole_workflow_scan_csv ], axis=1, ) fig = ( summary.melt(var_name="experiment", value_name="time_sec") .assign( package=lambda df: df.experiment.str[-6:], experiment=lambda df: df.experiment.str[:-7], ) .groupby(["experiment", "package"]) .time_sec.mean() .mul(1000) .reset_index() .pivot(columns="package", values="time_sec", index="experiment") .sort_values( by="polars", ) .reset_index() .pipe( lambda df: px.bar( df, y="experiment", x=["pandas", "polars"], barmode="group", text_auto='.2s', orientation="h", labels={"value": "Average execution Time (msec)", "experiment": ""}, title= f'Speed Showdown in Data Workflow Operations Pandas {pd.__version__} vs. Polars {pl.__version__}', width=1000, height=700, ) ) ) fig.update_xaxes(tickangle=90) fig.update_layout(yaxis=dict(automargin=True), legend_title_text="Library") fig.update_traces(textposition='outside') ``` Throughout our exploration, Polars has consistently outperformed Pandas. We intentionally focused on tasks that frequently arise in data analysis, encompassing file reading, column selection, filtering, and more. However, where Polars truly shines is in aggregation, capitalizing on its multiprocessing prowess. Take a glance at @fig-fig14, and you'll notice that reading files is the most time-consuming step for Polars. To address this (and reduce memory usage at the same time), we have the nifty `scan_csv` function that enables Polars to operate in a lazy mode, optimizing the entire data pipeline. It's clear that Polars packs a punch in the realm of data processing! Should you consider using Polars? Based on the findings of this experiment, it's evident that Polars can bring substantial benefits to every stage of your data processing. Its remarkable speed can make a significant impact. If you're open to investing time in understanding its API (which isn't drastically different from Pandas), I'm confident your projects will gain a significant speed boost. And don't forget, the [trajectory](https://www.crunchbase.com/funding_round/polars-seed--963c89d2) of the project seems promising, implying a bright future ahead. Whatever path you choose, ensure you become well-acquainted with your selected library and stay current with emerging technologies. And above all, relish the journey. Happy coding, and until next time! 🐼🐍🤓💻

Setup

Generate the synthetic data

Define helper class

Reading in data From a CSV File and parquet

CSV

parquet

Selecting Columns

Files awaiting in-memory reading

File is in memory

Filtering Rows

Based on one condition

Based on multiple conditions

Performing operations on columns

Single operation

Multiple operations

Concatenating Data

Aggregation

Simple

More complex

Whole workflow

Using pl.read_csv

Using pl.scan_csv

Putting it all together

Using `pl.read_csv`

Using `pl.scan_csv`